Data driven featurization and modeling

ABSTRACT

Computer-implemented systems and methods are disclosed for data driven expertise mapping. The systems and methods provide for obtaining data sets from data sources, wherein the data sets include services related data, analyzing the data sets, wherein the analysis generates information representative of the services related data, and generating training sets related to the data sets, wherein the training sets are based on known values. The systems and methods further provide for generating models, wherein the models are based on determining services provided by service providers using a combination of the services related data, the analysis of the data sets and the training sets, and provide a mapping of at least one service to service providers. The systems and methods additionally include evaluating the models based on known values and storing an indication for providing to a graphical user interface based on more models.

BACKGROUND

Large datasets exist mapping associations between services and serviceproviders across industries. Professionals in many different industriesprovide services to individuals. Details about these services are oftenstored in large data sets with loose associations. Moreover, asdifferent industries become more and more complex, the professionalswithin those industries begin to focus their training and experience onspecific aspects of their industry complicating existing classificationtechniques that are too broad to capture specialization. Theprofessionals become skilled in specific, narrow topics at the expenseof being less knowledgeable about other areas of the industry.

Determining which service providers offer a specific service or servicescan often be difficult and confusing. Traditional categorizations ofprofessionals do not accurately indicate their specific expertise or thespecific services provided. Most directories of professionals do notprovide enough detail to describe the exact skill sets of variousprofessionals in a broad industry.

Information related to provided services is often self-reported by theprofessional, and professionals will often exaggerate the breadth oftheir expertise. Some professionals can provide recommendations orreferrals, but these can also be unreliable as the professionalproviding the recommendation or referral may not be knowledgeable aboutthe best colleagues to perform the service. Moreover, it may not alwaysbe clear what exact services are required further complicating theability to find an appropriate professional. There is an increasing needto provide meaningful insight into the services and expertise ofspecific professionals within an industry and an increasing need toprovide data driven analytical techniques to associate services withservice providers.

Over time, more and more data is becoming available describingcharacteristics of the actual services that have been provided byservice providers. But there is a significant gap between the varioustypes of data available and an ability to dynamically process that datain a meaningful way that provides a benefit to those in seeking aprofessional.

Some current systems relying on data uses simple counting of the numberof times a professional has performed a particular service. But thisapproach fails if there is incomplete data for other professionals. Thistype of data may not indicate if the professional has performed theservice more than those other professionals for which there is no data.Moreover, this type of analysis provides no indication of the quality ofthe service provided. Additional systems may rely on consumer providedreviews, but this type of data can often provide too little data to besignificant and can often be skewed towards one or two individuals whohad a particularly good or bad experience.

BRIEF DESCRIPTION OF THE DRAWINGS

Reference will now be made to the accompanying drawings showing exampleembodiments of this disclosure. In the drawings:

FIG. 1 is a block diagram of an exemplary computing device, consistentwith embodiments of the present disclosure.

FIG. 2 is a block diagram of an exemplary system for data drivenexpertise mapping, consistent with embodiments of the presentdisclosure.

FIG. 3 is an exemplary data structure consistent with embodiments of thepresent disclosure.

FIG. 4 is an exemplary data structure consistent with embodiments of thepresent disclosure.

FIG. 5 is an exemplary data structure consistent with embodiments of thepresent disclosure.

FIG. 6 is a flowchart of an exemplary method for data driven expertisemapping, consistent with embodiments of the present disclosure.

DETAILED DESCRIPTION

Reference will now be made in detail to the exemplary embodimentsimplemented according to the present disclosure, the examples of whichare illustrated in the accompanying drawings. Wherever possible, thesame reference numbers will be used throughout the drawings to refer tothe same or like parts.

Many industries are becoming more and more specialized but are failingto provide an accurate and effective mechanism for matching consumerneeds with an appropriate professional. There is currently a need todevelop methods for building models that can analyze the availableindustry specific data in a standard way to provide accurate informationabout the specific services offered by various professionals in order toallow consumers to easily and accurately find professionals to meettheir needs.

The embodiments described herein provide technologies and techniques forevaluating vast amounts and types of data to generate models related toprofessionals in specific industries. These technologies can extractinformation from large and varied data sets, transform the data into aconsistent format, utilize various training sets and existing models togenerate new models related to predicting professional expertise. Theembodiments disclosed further include technologies for providing agraphical user interface and search functionality to match specific userneeds with appropriate professionals. In some embodiments, the systemsand methods described herein further provide technologies for evaluatingdata related to the needs of a consumer and automatically recommendingappropriate professionals.

The embodiments described herein can include system and methods forobtaining data associated with one or more professionals, aggregating asubset of the data, analyzing the data wherein the analysis includesinformation that is representative of the data, and generating one ormore models based on the aggregated data and the analysis, wherein theone or more models provide a mapping of at least one expertise to theone or more professionals. The systems and methods further provide forsearching for one or more professionals associated with a specifiedexpertise. Additional embodiments include generating training setsrelated to the data and updating the models based on the training sets.

The embodiments described herein can apply to many fields. Descriptionsand applications related to specific domains do not preclude theapplication of the described embodiments to other technologies orfields.

FIG. 1 is a block diagram of an exemplary computing device 100,consistent with embodiments of the present disclosure. In someembodiments, computing device 100 can be a server providing thefunctionality described herein. Further, computing device 100 can be asecond device providing the functionality described herein or receivinginformation from a server to provide at least some of that informationfor display. Moreover, computing device 100 can be an additional deviceor devices that store and/or provide data consistent with embodiments ofthe present disclosure.

Computing device 100 can include one or more central processing units(CPUs) 120 and system memory 121. Computing device 100 can also includeone or more graphics processing units (GPUs) 125 and graphic memory 126.CPUs 120 can be single or multiple microprocessors, field-programmablegate arrays, or digital signal processors capable of executing sets ofinstructions stored in a memory (e.g., system memory 121), a cache, or aregister. CPUs 120 can contain one or more registers for storingvariable types of data including, inter alia, data, instructions,floating point values, conditional values, memory addresses forlocations in memory (e.g., system memory 121 or graphic memory 126),pointers and counters. CPU registers can include special purposeregisters used to store data associated with executing instructions suchas an instruction pointer, instruction counter, and/or memory stackpointer. System memory 121 can include a tangible and/or non-transitorycomputer-readable medium, such as a flexible disk, a hard disk, acompact disk read-only memory (CD-ROM), magneto-optical (MO) drive,digital versatile disk random-access memory (DVD-RAM), a solid-statedisk (SSD), a flash drive and/or flash memory, processor cache, memoryregister, or a semiconductor memory. System memory 121 can be one ormore memory chips capable of storing data and allowing direct access byCPUs 120. System memory 121 can be any type of random access memory(RAM), or other available memory chip capable of operating as describedherein.

CPUs 120 can communicate with system memory 121 via a system interface150, sometimes referred to as a bus. GPUs 125 can be any type ofspecialized circuitry that can manipulate and alter memory (e.g.,graphic memory 126) to provide and/or accelerate the creation of images.GPUs 125 can store images in a frame buffer for output to a displaydevice such as display device 124. GPUs 125 can have a highly parallelstructure optimized for processing large, parallel blocks of graphicaldata more efficiently than general purpose CPUs 120. Furthermore, thefunctionality of GPUs 125 can be included in a chipset of a specialpurpose processing unit or a co-processor.

CPUs 120 can execute programming instructions stored in system memory121 or other memory, operate on data stored in memory (e.g., systemmemory 121) and communicate with GPUs 125 through the system interface150, which bridges communication between the various components ofcomputing device 100. In some embodiments, CPUs 120, GPUs 125, systeminterface 150, or any combination thereof, are integrated into a singlechipset or processing unit. GPUs 125 can execute sets of instructionsstored in memory (e.g., system memory 121), to manipulate graphical datastored in system memory 121 or graphic memory 126. For example, CPUs 120can provide instructions to GPUs 125, and GPUs 125 can process theinstructions to render graphics data stored in the graphic memory 126.Graphic memory 126 can be any memory space accessible by GPUs 125,including local memory, system memory, on-chip memories, and hard disk.GPUs 125 can enable displaying of graphical data stored in graphicmemory 126 on display device 124.

Computing device 100 can include display device 124 and input/output(I/O) devices 130 (e.g., a keyboard, a mouse, or a pointing device)connected to I/O controller 123. I/O controller 123 can communicate withthe other components of computing device 100 via system interface 150.It is appreciated that CPUs 120 can also communicate with system memory121 and other devices in manners other than through system interface150, such as through serial communication or direct point-to-pointcommunication. Similarly, GPUs 125 can communicate with graphic memory126 and other devices in ways other than system interface 150. Inaddition to receiving input, CPUs 120 can provide output via I/O devices130 (e.g., through a printer, speakers, or other output devices).

Furthermore, computing device 100 can include a network interface 118 tointerface to a LAN, WAN, MAN, or the Internet through a variety ofconnections including, but not limited to, standard telephone lines, LANor WAN links (e.g., 802.11, T1, T3, 56 kb, X.25), broadband connections(e.g., ISDN, Frame Relay, ATM), wireless connections, or somecombination of any or all of the above. Network interface 118 cancomprise a built-in network adapter, network interface card, PCMCIAnetwork card, card bus network adapter, wireless network adapter, USBnetwork adapter, modem or any other device suitable for interfacingcomputing device 100 to any type of network capable of communication andperforming the operations described herein.

FIG. 2 is a block diagram representing exemplary system 200 for datadriven expertise mapping consistent with embodiments of the presentdisclosure. System 200 can include data input engine 210 that canfurther include data extractor 211, data transformer 212, and dataloader 213. Data input engine 210 can process data from data sources201-204. Data input engine 210 can be implemented using computing device100 from FIG. 1. For example, data from data sources 201-204 can beobtained through I/O devices 130 and/or network interface 118. Further,the data can be stored during processing in a suitable storage such asstorage 128 and/or system memory 121.

Data input engine 210 can also interact with data storage 215. Datastorage 215 can further be implemented on a computing device such ascomputing device 100 that stores data in storage 128 and/or systemmemory 121 as shown in FIG. 1.

System 200 can include aggregation engine 220, pre-computation engine230, training set generator 240, and model builder 250. Similarly todata input engine 210, aggregation engine 220, pre-computation engine230, training set generator 240, and model builder 250 can beimplemented on a computing device such as computing device 100 from FIG.1, can utilize storage 128 and/or system memory 121 for storing data,can utilize I/O device 130 or network interface 118 for transmittingand/or receiving data.

System 200 can further include storage 255, search engine 260, and userinterface 270, which can all also be implemented on a computing devicesuch as computing device 100 described in FIG. 1. Each of data inputengine 210, data extractor 211, data transformer 212, data loader 213,aggregation engine 220, pre-computation engine 230, training setgenerator 240, model builder 250, search engine 260, and user interface270 can be a module, which is a packaged functional hardware unitdesigned for use with other components or a part of a program thatperforms a particular function of related functions. Each of thesemodules can be implemented using computing device 100 of FIG. 1. Each ofthese components is described in more detail below.

In some embodiments, the functionality of system 200 can be split acrossmultiple computing devices (e.g., multiple devices similar to computingdevice 100) to allow for distributed processing of the data. In theseembodiments the different components can communicate over I/O device 130or network interface 118.

System 200 can be related to many different domains or fields of use.Descriptions of embodiments related to specific domains, such ashealthcare, is not intended to limit the disclosed embodiments to athose specific domains, and embodiments consistent with the presentdisclosure can apply to any domain that utilizes predictive modelingbased on available data.

Data input engine 210 is a module that can retrieve data from a varietyof data sources (e.g., data source 201, 202, 203, and 204) and processthe data so that it can be used with the remainder of system 200. Datainput engine 210 can further include data extractor 211, datatransformer 212, and data loader 213.

Data extractor 211 acquires and/or retrieves data from data sources 201,202, 203, and 204. Each of these data sources can represent a differenttype of data source. For example, data source 201 can be a database.Data source 202 can represent structured data. Data sources 203 and 204can be flat files. Further, data sources 201-204 can contain overlappingor completely disparate data sets. As an example, data source 201 cancontain physician demographic information while data sources 202, 203,and 204 contain various insurance claim, drug claim, and medicaltreatment data. This data can represent services related data. In ahealthcare context, services related data can include, among otherthings, information about services offered or performed by a physician.For example, data source 201 can contain data structure 300 of FIG. 3,data source 202 can contain data structure 400 of FIG. 4, and datasource 203 can contain data structure 500 of FIG. 5. Data extractor 211can interact with the various data sources, retrieve the relevant data,and provide that data to data transformer 212.

Data transformer 212 can receive data from data extractor 211 andprocess the data into standard formats. In some embodiments, datatransformer 212 can normalize data such as dates. For example, datasource 201 can store dates in day-month-year format while data source202 can store dates in year-month-day format. In this example, datatransformer 212 can modify the data provided through data extractor 211into a consistent date format. Accordingly, data transformer 212 caneffectively clean the data provided through data extractor 211 so thatall of the data, although originating from a variety of sources, has aconsistent format. Data transformer 212 can provide the normalized datato data loader 213.

Data loader 213 can receive the normalized data from data transformer212. Data loader 213 can merge the data into varying formats dependingon the specific requirements of system 200 and store the data in anappropriate storage mechanism such as data storage 215. In someembodiments, data storage 215 can be data storage for a distributed dataprocessing system (e.g., Hadoop Distributed File System, Google FileSystem, ClusterFS, and/or OneFS). In some embodiments, data storage 215can be a relational database (described in more detail below). Dependingon the specific embodiment, data loader 213 can optimize the data forstoring and processing in data storage 215. In some embodiments, datastructures 300, 400, and 500 from FIGS. 3, 4, and 5 (or versionsthereof) can be stored by data loader 213 in data storage 215.

Data storage 215 can provide storage and access to data processed bydata input engine 210. The data stored in data storage 215 can beprovided in a single database or can be distributed over multipledatabases that are synchronized or aggregated together to provide asingle data store. In some embodiments, data storage 215 is adistributed database management system (DDBMS) (e.g., Apache'sCassandra) that provides multiple databases all containing synchronizeddata. In this type of system, the components of system 200 can accessany of the databases provided as part of data storage 215. Writes andupdates to one of the various databases provided by data storage 215 canbe synchronized across the databases in the distributed cluster by datastorage 215. By distributing the storage across multiple physicaldatabase nodes, data storage 215 can provide improved reliability andaccess to data stored in data storage 215.

Aggregation engine 220, pre-computation engine 230, and model builder250 can process the data prepared by data input engine 210 and stored indata storage 215. Aggregation engine 220, pre-computation engine 230,and model builder 250, can retrieve data from data storage 215 that hasbeen prepared by data input engine 210. For example, data structures300, 400, and 500 of FIGS. 3, 4, and 5, each of which are described inmore detail before returning to additional description of system 200,can be suitable inputs to aggregation engine 220, pre-computation engine230, and model builder 250.

As shown in FIG. 3, data structure 300 is an exemplary data structure,consistent with embodiments of the present disclosure. Data structure300 can store data records associated with professionals. While datastructure 300 is shown to store information related to physicians, it isappreciated that it can store information related to any profession.Data structure 300 can, for example, be a database, a flat file, datastored in memory (e.g., system memory 121), and/or data stored in anyother suitable storage mechanism (e.g., storage 128).

In some embodiments, data structure 300 can be a Relational DatabaseManagement System (RDBMS) (e.g., Oracle Database, Microsoft SQL Server,MySQL, PostgreSQL, and/or IBM DB2). An RDBMS can be designed toefficiently return data for an entire row, or record, in as fewoperations as possible. An RDBMS can store data by serializing each rowof data of data structure 300. For example, in an RDBMS, data associatedwith record 301 of FIG. 3 can be stored serially such that dataassociated with all categories of record 301 can be accessed in oneoperation. Moreover, an RDBMS can efficiently allow access of relatedrecords stored in disparate tables. For example, in an RDBMS, datastructure 300 of FIG. 3 and data structure 400 (described in more detailbelow) of FIG. 4 can be linked by a referential column. In this example,professional ID 480 of data structure 400 can directly relate toprofessional ID 310 of data structure 300. An RDBMS can allow for theefficient retrieval of all records in data structure 400 associated witha record of data structure 300 based on a common value for therespective professional ID fields (e.g., professional ID 480 of datastructure 400 and professional ID 310 of data structure 300).

In some embodiments, data structure 300 of FIG. 3 can be anon-relational database system (NRDBMS) (e.g., XML, Cassandra, CouchDB,MongoDB, Oracle NoSQL Database, FoundationDB, and/or Redis). Anon-relational database system can store data using a variety of datastructures such as, among others, a key-value store, a document store, agraph, and a tuple store. For example, a non-relational database using adocument store could combine all of the data associated with aparticular professional ID (e.g, professional ID 310 of data structure300 and professional ID 480 of data structure 400 in FIG. 4) into asingle document encoded using XML. In this example, the XML documentwould include the information stored in record 302 of data structure 300and record 405 of data structure 400 based on these records sharing thesame professional ID value.

Data structure 300 of FIG. 3 can include data records 301-305representing physicians in addition to countless additional records upto record 399. Data structure 300 can contain many thousands or millionsof records of data and is limited only by the physical constraints ofthe system upon which the data structure exists.

Data structure 300 can include categories of data representing aphysician. For example data structure 300 can include categoriesprofessional ID 310, gender 320, age 330, location 340, specialty 350,and affiliations 360. Data associated with data records 301-305 can bestored under each of these categories. For example, a physicianrepresented by data record 301 has a person ID of “1,” is male asrepresented by an “M” under gender 320, is 54 as listed under age 330,works in zip code “94403” as represented under location 340, specializesin cardiology as represented under specialty 350, and is affiliated withthe Palo Alto Medical Foundation (PAMF) as represented by affiliations360.

In some embodiments, data structure 300 can contain more or fewercategories for each data record. For example, data structure 300 caninclude additional categories of data such as certifications, education,publications, or any other category of data associated with anprofessional. Moreover, depending on the circumstances, data structure300 can contain domain specific data. For example, in a healthcarecontext, in addition to healthcare specific specialty 350 andaffiliations 360 data, data structure 300 can include insurance coverageinformation, practice or group name, teaching positions, or otherinformation related to a physician. Accordingly data structure 300 isnot limited to only those categories shown in FIG. 3.

In some embodiments, data structure 300 contains categories that storesimilar data. For example, data structure 300 can include location 340that represents a home address zip code, while an additional “location”category (not shown) can be used to store a business zip code.

Additionally, data structure 300 can include combination categories. Forexample, instead of only using location 340 to represent locationinformation, data structure 300, in some embodiments, includescategories for, among others, street address, state, city, and/orcountry. This data can be stored under one category or separatecategories that, together, represent a location.

Moreover, location 340 can store different types of data. In someembodiments, location 340 is a zip code. In other embodiments, location340 is a combination category as previously described. Location 340 canfurther include, geospatial coordinates, map coordinates, or any otherdata type that indicates location.

Similarly to location 340, other categories, such as age 330, specialty350, and affiliations 360, can include data in a variety of formats. Forexample, age 330 can be represented in years, in years and months, indays, or by a date of birth.

In some embodiments, data stored under a category can be a referenceinto another data set or data structure as is common in relational datasets. For example, specialty 350 and affiliations 360 can contain anidentifier that references a description stored in a separate data setor lookup table instead of containing text or another data type.

Additionally, as shown in FIG. 4, data structure 400 is an exemplarydata structure, consistent with embodiments of the present disclosure.Data structure 400 can store data records associated with events thatare further associated with specific individuals. Similarly to datastructure 300 described in FIG. 3, data structure 400 can, for example,be a database, a flat file, data stored in memory (e.g., system memory121 of computing device 100 from FIG. 1), an RDBMS, an NRDBMS, and/ordata stored in any other suitable storage mechanism (e.g., storage 128of computing device 100 from FIG. 1). Moreover, data structure 400 canbe implemented or stored computing device similar to computing device100 described in FIG. 1.

Data structure 400 can store information related to events. Datastructure 400 can include data records 401-406 representing dataassociated with specific events in addition to countless additionalrecords up to record 499. Data structure 400 can contain millions orbillions of records of data and is limited only by the physicalconstraints of the system upon which the data structure exists.

Data structure 400 can include categories of data. For example, datastructure 400 can include the categories event ID 410, person ID 420,cost 430, code 1 440, code 2 450, code 3 460, date 470, and professionalID 480. Data associated with data records 401-406 can be stored in eachrespective row of data structure 400 within one of these categories. Forexample, an event represented by data record 401 is associated with aperson ID 420 of “1,” has a cost 430 of “$8000,” has values of “409,”“10021,” and “R0076,” for code 1 440, code 2 450, and code 3 460,respectively, a date 470 of “Jan. 1, 3010,” and a professional ID 480 of“3.”

Moreover, data structure 400 can include multiple data recordsassociated with the same individual or professional. For example, datarecords 401-403 all have a value of 1 for person ID 420. Moreover, datarecords 401-404 all have a value of 3 for professional ID 480. Thesevalues can refer to a person ID number or professional ID number storedin a separate data set. For example, professional ID 480 can refer toprofessional ID 310 of data structure 300 described in FIG. 3. In thisexample, data records 401-404 of data structure 400 can be associatedwith data record 303 of data structure 300. Moreover, data record 405 ofdata structure 400 can be associated with data record 302 of datastructure 300 and data record 406 of data structure 400 can beassociated with data record 305 of data structure 300 based on thevalues in professional ID 480 and professional ID 310 of data structure300 in FIG. 3.

In some embodiments, the data records in data structure 400 are allrelated to the same type of event or a specific domain. For example,data structure 400 can contain data records related to medical insuranceclaims. In these embodiments, data structure 400 includes additionalcategories that are specific to these types of events or domains, suchas categories for deductibles. Moreover, in these embodiments, existingcategories may contain information related to the domain of the data.For example, in embodiments where data structure 400 includes healthinsurance claim data, code 1 440, code 2 450, and code 3 460 canrepresent International Statistical Classification of Diseases andRelated Health Problems (ICD) codes, Current Procedural Terminology(CPT) codes, and Healthcare Common Procedure Coding System (HCPCS) codesrespectively. Additionally, these types of codes can representhierarchical data. Accordingly, a specific code in one of code 1 440,code 2 450, or code 3, 460 may imply additional codes or proceduresbased on the specific classification system in use. In a differentdomain, code 1 440, code 2 450, and code 3 460 can represent differentidentifying information for the events represented in data structure400.

Similarly to data structure 300, data structure 400 can include more orfewer categories for each data record depending on the domain and thesource of the data record. Additionally, as described in relation todata structure 300, some categories of data structure 400 can store datain different formats that represent the same concept, such as a date orcost. For example, date 470 can contain only a month and year, or cancontain month, day, and year. In a similar example, cost can containvalues in terms of United States Dollars or in terms of othercurrencies.

Additionally, as shown in FIG. 5, data structure 500 is an exemplarydata structure, consistent with embodiments of the present disclosure.Data structure 500 can store data records associated with events thatare further associated with specific individuals. Similarly to datastructure 300 and data structure 400 described in FIGS. 3 and 4, datastructure 500 can, for example, be a database, a flat file, data storedin memory (e.g., system memory 121 of computing device 100 from FIG. 1),an RDBMS, an NRDBMS, and/or data stored in any other suitable storagemechanism (e.g., storage 128 of computing device 100 from FIG. 1).Moreover, data structure 500 can be implemented or stored computingdevice similar to computing device 100 described in FIG. 1.

Data structure 500 can store information related to events associatedwith a product. For example, event could be the purchase of a product,or in the domain of healthcare, prescription information related to adrug. Data structure 500 can include data records 501-506 representingdata associated with a specific event in addition to countlessadditional records up to record 599. Data structure 500 can containmillions or billions of records of data and is limited only by thephysical constraints of the system upon which the data structure exists.

Data structure 500 can include categories of data. For example, datastructure 500 can include the categories event ID 510, product ID 520,person ID 530, cost 540, date 550, and professional ID 560. Dataassociated with data records 501-506 can be stored in each respectiverow of data structure 500 within one of these categories. For example,an event represented by data record 501 is associated with a product ID520 of “0573-0133,” person ID 530 of “1,” has a cost 540 of “$4,500,” adate 550 of “Jan. 1, 3010,” and a professional ID 580 of “5.” In thisexample, product ID 520 can be a reference to the ID for a drug listingin the National Drug Code (NDC) database, and data record 501 canrepresent a prescription for a medication, such as Advil®. Moreover,data structure 500 can include multiple data records associated with thesame individual or professional. For example, data records 501-503 allhave a value of “1” for person ID 530. Moreover, data records 501-504all have a value of 3 for professional ID 580. These values can refer toa person ID number or professional ID number stored in a separate dataset.

For example, professional ID 580 can refer to professional ID 310 ofdata structure 300 described in FIG. 3. In this example, data records501-504 of data structure 500 can be associated with data record 303 ofdata structure 300. Moreover, data record 505 of data structure 500 canbe associated with data record 302 of data structure 300 based on thevalues in professional ID 580 and professional ID 310 of data structure300 in FIG. 3.

In some embodiments, the data records in data structure 500 are allrelated to the same type of event or a specific domain. For example,data structure 500 can contain data records related to drug prescriptionclaims. In these embodiments, data structure 500 includes additionalcategories that are specific to these types of events or domains, suchas categories for deductibles. Moreover, in these embodiments, existingcategories may contain information related to the domain of the data.For example, in embodiments where data structure 500 includes drugprescription claim data, product ID 520 can represent National DrugCodes (NDC) that are part of the National Drug Code Directory.

Similarly to data structure 300 and data structure 400, data structure500 can include more or fewer categories for each data record dependingon the domain and the source of the data record. Additionally, asdescribed in relation to data structure 300 and 400, some categories ofdata structure 500 can store data in different formats that representthe same concept, such as a date or cost. For example, date 550 cancontain only a month and year, or can contain month, day, and year. In asimilar example, cost can contain values in terms of United StatesDollars or in terms of other currencies.

Referring back to FIG. 2, aggregation engine 220 can retrieve the datastored in data storage 215 (e.g., data structures 300, 400, and 500 fromFIGS. 3, 4, and 5), and further aggregate the data into informationusable for building models. Aggregation engine 220 can combine data fromvarious tables into a more usable form as well as combining informationstored in the same tables that may be indicative of similar events.Additionally aggregation engine 220 can aggregate only a subset of theavailable data or data sets depending on the circumstances. In ahealthcare context, for example, aggregation engine 220 can combinemedical claim data stored in data structure 400 of FIG. 4 withprescription drug data stored in data structure 500 of FIG. 5. Forexample, by using professional ID 480 and professional ID 560 of datastructures 400 and 500, respectively, aggregation engine 220 can combinedata records 401-404 of data structure 400 and records 501-504 into onedata set that is representative of the general behavior of the physicianrepresented by professional ID “3.”

Additionally, aggregation engine 220 can consider many additional datasets processed by data input engine 210 and stored in data storage 215beyond those data sets shown in FIGS. 4 and 5. For example, additionaldata sets may contain drug prescription data but relying on a differentcoding system than the coding system corresponding to product ID 520 ofdata structure 500. In this example, aggregation engine 220 can analyzeboth sets of data and produce a unified data set that includesinformation from the different forms of data in a standard data set thatboth fully represents the drug prescription data associated with apractitioner and is also in a format usable by other components ofsystem 200.

Aggregation engine 220 can aggregate data stored within the same dataset. For example, the NDC numbers used to represent medications caninclude many product IDs or codes for essentially the same drug. Forexample, different codes can be used for name brand drugs Advil, Motrin,and the generic drug Ibuprofen even though all of these drugs use thesame formula and are used to treat similar conditions or symptoms.Moreover each of the different dosage possibilities for a drug (e.g.,Advil) can be represented by different NDC numbers. Aggregation engine220 can combine data that includes all of these different variationsinto a generalized format to represent drug prescriptions associatedwith a doctor for a pain reliever that uses the formula for ibuprofen.This format can be more useful for analyzing the types of conditions adoctor treats or the types of procedure a physician performs than datadivided by specific dosage and name brand. In some embodiments,aggregation engine 220 can rely on information external to the data setsobtained using data input engine 210. These data sets can be stored indata storage 215 and can include, for example, the previously describedNDC data.

In some embodiments, aggregation engine 220 can aggregate claims andprescription data based on the associated physician. Aggregation engine220 can combine all data for a particular physician or create multipleaggregated data sets based on grouping according to specific categories.For example, aggregation engine 220 can aggregate claims data for aparticular physician into multiple groups of data based on procedurenames, diagnosis-codes, and/or date ranges. Data records 401, 402, 403,and 404 of FIG. 4 can all be associated with the same physician (e.g., aphysician having a professional id of 3). If data records 402, 403, and404 are all associated with the same procedure, aggregation engine 220can create a grouping that contains only data records 402, 403, and 404based on their related physician and procedure data. In another example,aggregation engine 220 can create a grouping of data records 401, 402,and 403 based on a date range that includes any claims after 2010.Aggregation engine can combine the available date in multiple wayscreating multiple aggregations or groupings from the same set of data.

Referring back to FIG. 2, pre-computation engine 230 can retrieve andprocess data stored in data storage 215 (e.g., data structures 300, 400,and 500 from FIGS. 3, 4, and 5). In addition to processing the dataprovided by data input engine 210, pre-computation engine 230 canprocess data output from aggregation engine 220 and stored in datastorage 215. Moreover, pre-computation engine 230 can utilize dataexternal to the data obtained using data input engine 210. This externaldata can be stored in data storage 215. Pre-computation engine 230 cananalyze the various data sets and produce additional analytics anddetails regarding the data sets. This analysis can be in the form ofstatistics related to the data sets or in the form of data thatrepresents additional determinations that can be drawn from the data.The analysis can contain any additional information that isrepresentative of the analyzed data. Pre-computation engine 230 canprocess extremely large sets of data in an efficient manner in order toprovide optimized data sets for model builder 250.

In a healthcare context, pre-computation engine 230 can analyze all ofthe claims data and provide count information based on the number ofclaims that are available. Each count can correspond to a specific ICD,CPT, HCPCS, and/or NDC code and represent the number of times aphysician has used a specific code or prescribed a specific drug.

Pre-computation engine 230 can also analyze groupings within the variousdata sets to make additional determinations. For example,pre-computation engine 230 can recognize, using known data stored in,for example, data storage 215 or based on other data sources, that acertain combination of drugs are used to treat a specific disease or areused for a specific procedure. For example, a specific grouping of drugsis used to treat multiple-sclerosis (MS). Pre-computation engine 230 cananalyze a data set and, if pre-computation engine determines thatmultiple records of claims and/or drug prescriptions show treatmentsusing or prescriptions for the grouping of drugs associated with MS,pre-computation engine 230 can indicate that the physician associatedwith the records likely treats MS.

Pre-computation engine 230 can also use combined code information tomake determinations regarding treatment patterns of physicians. Oftentimes, a single ICD code can indicate a broad range of conditions ortreatments or can be tangentially related to other conditions. Forexample, physicians that treat obesity and physicians that treatconditions related to insulin production can both use ICD codes fordiabetes even though the specific conditions treated are different andthe treatments utilized could benefit completely different types ofpatients. Pre-computation engine 230 examines additional claims for apatient associated with a physician as well as alternate coding systemsto determine the specific condition treated by comparing the grouping ofclaims and codes against known data sets. Accordingly even though, bothphysicians use the code associated with diabetes, pre-computation engine230 can use the additional claims and code data to determine adistinction between the specific conditions treated by each physician.

Moreover, pre-computation engine 230 can produce intermediate models foruse by model builder 250. For example, large sets of patient claim datacan be analyzed by pre-computation engine 230 to determine commonattributes or characteristics that are consistent with specific patientconditions. For example, all of the available patient and claims datacan be analyzed to generate intermediate models that predict, based onthe claims data and possibly other data sources, patients that sufferfrom diabetes. This intermediate model can be provided to model builder250, which can then use the pre-computed or pre-analyzed model todetermine which physicians treat those identified patients and diabetesin general. In some embodiments, pre-computation engine 230 identifiestreatments associated with patients where procedure codes did notpreviously provide sufficient specificity.

Pre-computation engine 230 can utilize any available type of data ordataset and is not limited to claims data. For example, pre-computationengine 230 can combine prescription drug claims data, patient history ordemographic data, patient survey information, or any other type of datato generate or improve intermediate models or data. Pre-computationengine 230 can also use these additional datasets independently toproduce intermediate models. Pre-computation engine 230 can produce manyintermediate models, data sets, or analysis for use by model builder 250depending on the available data, the specific domain, or the settingsfor the particular system.

Model builder 250 can utilize the data stored in data storage 215 togenerate models that associate professionals with the specific servicesthey provide. The data stored in data storage 215 can include both theoriginal data provided by data input engine 210 as well as additionaldata created or compiled by aggregation engine 220 and/orpre-computation engine 230. The data used can be domain specific to theindustry being considered and can further include data external to thedata obtained using data input engine 210. This external data can bestored in data storage 215. The following described embodiments thatdiscuss model generation in relation to healthcare is not meant to limitthe applications of the disclosed embodiments to that particularindustry. The use of healthcare is an exemplary example of thetechniques used to generate models for professional expertise mapping.

In embodiments related to healthcare, model builder 250 can process theavailable data to produce models that associate physicians with thespecific conditions they treat or the specific procedures, treatments,and/or expertise they offer. The models can result in better patientcare by allowing patients to find physicians that treat their specificconditions. For example, a patient that requires a stent can utilize themodels generated by model builder 250 to find cardiologists,radiologists, interventional radiologists, or other specialist whospecifically place stents while avoiding members of those samespecialties who possess little or no expertise in placing stents.

Model builder 250 can utilize a number of techniques to build models.Model builder 250 can use data created by aggregation engine 220 toinform the model. For example, as previously described, aggregationengine 220 can combine related drug information into a combined set ofthe types of drugs prescribed by a particular physician. Additionally,pre-computation engine 230 can associate certain groupings of drugs witha particular treatment or condition. For example, aggregation engine 220can combine information related to the various dosages and names fordifferent drugs that are used to treat MS. This aggregation can combineboth generic and name brand drugs into a unified set. Accordingly, wherethe data may have initially appeared to indicate that the physician wasprescribing unrelated drugs, the data provided to model builder 250 byaggregation engine 220 can better indicate that even though the names,dosages, or NDC numbers for the various drugs are different, they canmap to a smaller set of related or similar drugs.

For example, drug prescription data can indicate that a patient receivedBetaseron® in various dosages over a period of time and then receivedExtavia® in various dosages for an additional period of time. In someembodiments, aggregation engine 220 can further delineate theinformation based on specific date ranges or time intervals. Although,the differences in brand name and dosages, in this example, could resultin multiple different drug records each having different NDC numbers,aggregation engine 220 can provide model builder 250 with a unified dataset that demonstrates that each of these records indicates that thecondition was treated with Interferon beta 1b, the generic formulationfor Betaseron® and Extavia®. Accordingly, model builder 250 can receiveinformation that a condition was treated with Interferon beta 1b, eventhough that information is based on multiple data records that initiallyappeared unrelated. Moreover, as previously discussed, pre-computationengine 230 can provide information that drugs, such as Interferon beta1b, are known to be used to treat a specific condition, such as MS.Model builder 250 can utilize the two types of information provided byaggregation engine 220 and pre-computation engine 230 to determine thatthe physicians aggregated drug data corresponds to the computed set ofdrugs that are used to treat MS and to determine that the physiciantreats MS.

In another example, model builder 250 can use physician demographicinformation to determine treatment of certain conditions. For example,model builder 250 can identify a physician who treats a particularcondition or set of conditions. Model builder 250 can further analyzeadditional demographic information (e.g., physician demographicinformation stored in data structure 300 of FIG. 3) and identifyphysicians who had similar education, fellowships, publications,affiliations, and/or certifications. Additional demographic factors canalso be used. In this way, model builder 250 can use demographicinformation associated with a physician known to treat certain specificconditions, and determine that a physician, for which there is littleclaim data, also treats similar conditions based on demographicinformation.

Model builder 250 can use existing base models as initial inputs. Forexample, model builder 250 can utilize basic claim counting. In thisexample, model builder can determine that a physician performs kneereplacements if a physician is associated with a large number of claimsassociated with knee replacements or a large number of claims associatedwith knee replacements relative to other physicians with similarpractices. Alone, this type of data can lead to both false positives andfalse negatives. For example, simply counting claims can provide nocontext about whether or not the count is significant enough to considerthe physician to have expertise in knee replacements potentiallyresulting in identifying physicians who are not actually experts.Moreover, there may be many physicians who perform knee replacement butbecause the claim data can be incomplete, they can be identified as notperforming knee replacements. Although this type of data alone hasdrawbacks , when combined with other data sources, such as, amongothers, insurance guides, self-reported specialties, physiciandemographic information, publications, and/or prescription drug records,it can further inform the model generation.

Another example of a basic model used by model builder 250 is specificclaim information. For example, although coding systems may includegeneric codes, some coding systems also contain very specific codes thatare associated with a very specific condition. For these types ofconditions, physicians using those specific codes can be determined totreat those specific conditions. Codes that provide the necessary levelof specificity can vary based on the domain and can be determinedempirically. Model builder 250 can be provided with information relatingto which codes provide the necessary level of specificity and whichcodes do not. This information can be stored in, for example, datastorage 215. As an example, in a healthcare domain, ICD 9 codes usingthe abbreviations for Not Elsewhere Classified (NEC) or Not OtherwiseSpecified (NOS) can be considered too generic to provide a specificcondition. Additionally, codes such as E66.9 for “Obesity, unspecified”or S09.02 for “Unspecified injury of the nose” can be determined to betoo general to identify a specific condition. But, as an example, ICD 9codes 201.58 for “Hodgkin's disease, nodular sclerosis, lymph nodes ofmultiple sites” and 249.7 for “Secondary diabetes mellitus withperipheral circulatory disorders” can be specific enough to identifyphysicians who treat those specific conditions.

Additionally, model builder 250 can rely on known information regardingcommon claims. For example, if the vast majority of the time, physiciansjustify treatments for acid reflux using a specific code from one of thevarious coding systems, claim data utilizing that specific code canindicate that the associated physicians treat acid reflux.

In addition to the data sources already discussed, included datarepresented by data structures 300 of FIG. 3, 400 of FIG. 4, and 500 ofFIG. 5, additional types of data can be used by model builder 250. Forexample, in a healthcare context, in addition to the medical andpharmacy claims discussed, model builder 250 can rely on data thatincludes behavioral health claims, academic publications, physicianspecialties, physician referral patterns, physician board certificationsand licenses, physician residency information, and/or physicianfellowship training. Moreover, in addition to the methods alreadydiscussed above, model builder 250 can utilize various algorithms forgenerating the models that include probabilistic graphical modeling(e.g., bayesian networks, association rules (e.g. Apriori, FP-Growth,and/or Eclat), supervised and semi-supervised learning, and/ordimensionality reduction.

In addition to the data sets stored in data storage 215 and the initialbase models, which can also be stored in data storage 215, model builder250 can use training sets provided by training set generator 240 duringmodel creation. Training set generator 240 can use a variety ofmechanisms for developing training sets to test a model. In someembodiments related to healthcare, training set generator can use knowndata about physicians or claims associated with specific treatments orprocedures as training data. Training set generator 240 can select asubset of this known information and provide it to model builder 250 totest and refine the generated models.

Training set generator 240 can use additional sources of information tocreate both positive and negative training sets. In the context ofhealthcare, medical publications include Medical Subject Headings (MeSHterms) standardized by the National Institute of Health that cancategorize publications as relating to particular treatments orconditions. Training set generator can identify a condition commonlytreated by physicians in a certain specialty. For example, training setgenerator 240 can gather specific terms related to different conditionsclassified under diabetes. For a given MeSH term, training set generator240 can then examine physician data to determine which of thosephysicians publish on diabetes but do not publish on the specific MeSHterm associated with the conditions being targeted. Training setgenerator 240 can use this as a negative training set to compare to thegenerated model of physicians who treat the conditions associated withthe selected MeSH term. Similarly training set generator 240 can usesimilar logic to identify physicians who do publish on that specificMeSH term as a positive training set for comparing to the generatedmodels.

Training sets can additionally be created based on other types of dataand methods. Claims data can be used to create training sets. By using aspecific collection of procedures or diagnosis codes that correspond toa specific condition, training set generator 240 can select thosephysicians associated with a large number of those types of claims as apositive training set. Additionally, information about conditions thatphysicians treat can be retrieved from the physician's staff or officeor crowd-sourced information about treatments a physician has provided.Training set generator 240 can use this information to create additionalpositive training sets. Moreover, training set generator 240 can analyzeinformation such as board certifications and/or membership withincertain societies that have been empirically determined to indicate thetreatment of certain conditions to generate additional positive trainingsets.

As shown, a variety of data sets,(e.g. base models based on the numberof claims and/or prescriptions, the specificity of coding systems, andcommonly used claim data, unaltered data stored in data storage 215,data generated by aggregation engine 220, data generated bypre-computation engine 230, training sets created by training setgenerator 240, and previous models produced by model builder 250) can beused by model builder 250 to create models. Each data sets can be usedalone or in combination with any of the other data sets. The specificdomain and purpose of the model can determine which data sets are used.For example, a model used to determine physicians who treat MS can, aspreviously shown, rely on a combination of known treatment patternsstored in data storage 215 or provided by training set generator 240,claims data stored in data storage 215, aggregated data from aggregationengine 220, and computed analytics from pre-computation engine 230, todetermine that physicians using codes related to MS and prescribinggroups of drugs commonly used to treat MS likely treat MS.Alternatively, when conditions are represented by very specific codes,looking at only claim data can be sufficient to determine physicians whotreat that condition. As shown above, claims that include ICD 9 code201.58 for “Hodgkin's disease, nodular sclerosis, lymph nodes ofmultiple sites” can provide enough specificity to determine that aphysician treats the coded condition.

The basis for which data sources are used by model builder 250 can bebased on user configuration, past model generation, or statisticalanalysis of generated models. Techniques such as Structure Learning orother probabilistic graphical models can be used to select sources.Also, the type of condition being modeled can determine which datasources can be the most useful. This determination can be made based onempirical information, the structure of the model being produced,statistical analysis, or specified by the user of system 200.

Model builder 250 can continue to utilize additional training sets andother data as it is made available to continually refine the generatedmodels. Model builder 250 can store generated models in data storage255. Data storage 255 can, for example, be a database, a flat file, datastored in memory (e.g., system memory 121 of computing device 100 fromFIG. 1), an RDBMS, an NRDBMS, and/or data stored in any other suitablestorage mechanism (e.g., storage 128 of computing device 100 from FIG.1). In some embodiments, data storage 255 can be part of data storage215.

Search engine 260 can provide access to the information stored in datastorage 255. In particular, search engine 260 can provide access to themodels created by model builder 250. In embodiments related to ahealthcare context, search engine 260 can receive user input (i.e.,through user interface 270, described in more detail below) identifyinga condition, and search the available models for physicians that treatthat specific condition. By using the models generated by model builder250, the resulting list of physicians can include physicians classifiedunder many different traditional specialties who all treat the specificcondition and their corresponding contact information. By providingaccess to models created by model builder 250, search engine 260 canprovide results tailored to the patient's specific needs as opposed tosimply providing physicians who are associated with general specialtiesand who do not actually treat the specified condition.

Search engine 260 can utilize additional filters and processing in orderto provide more relevant results. For example, search engine 260 canutilize a natural language processor to translate commonly used phrasesor wording into medical terms that would be used by model builder 250.This allows a user of search engine 260 to input a common phrase butreceive accurate results based on the underlying medical terms used.

Search engine 260 can utilize geographic filtering to provide resultsrelated to a user's specific location. This can be useful in domains,such as healthcare, where a patient can only travel a certain distanceto see a physician. Results for physicians outside of range(predetermined or indicated by the user) can be irrelevant.

Additionally, search engine 260 can consider user preferences. Inembodiments related to healthcare, some patients can have gender orlanguage preferences for physicians that treat them. Search engine 260can account for these preferences when providing the results. In anotherexample, past history may indicate that a patient requires physiciansidentified as having certain styles of bedside manner. This type of datacan be included in the models generated by model builder 250 and can beused by search engine 260 to provide more relevant results.

Additionally, in some embodiments, search engine 260 can provide resultswithout specific user input. For example, past medical history or pastinteractions with a system can indicate that an individual needs aphysician that treats a specific condition. A system utilizing searchengine 260 can proactively generate search results based on thegenerated models and provide those results in the form ofrecommendations to the patient without the patient needing to search fora specific condition.

System 200 can further include user interface 270. User interface 270can be a graphical user interface displayed on a computing device suchas computing device 100 of FIG. 1 (e.g., using display device 124). Insome embodiments, user interface 270 accepts user input and providesthat input to search engine 260. Moreover, user interface 270 canprovide the output of search engine 260 on a graphical user interfaceand allow the user to interact with the results. In some embodiments,the results from search engine 260 can be incorporated into a separateapplication displayed on user interface 270 that incorporates the dataprovided by search engine 260.

FIG. 6 is a flowchart of an exemplary method 600 for data drivenexpertise mapping. It will be readily appreciated that the illustratedprocedure can be altered to delete steps or further include additionalsteps. After initial step 601, the system (e.g., system 200 from FIG. 2)can obtain (step 610) data associated with at least one event (e.g.,data stored in data structures 300, 400, and 500 of FIGS. 3, 4 and 5).Obtaining the data can include, as described in relation to FIG. 2,extracting (e.g., using data extractor 211) the data from multiple datasources (e.g., data sources 201-204), transforming the data (e.g., usingdata transformer 212), and loading the data (e.g., using data loader213) into a storage location (e.g., data storage 215) for additionalanalysis. Through this process, the system can prepare data from avariety of sources into a normalized and consistent representation readyfor further processing.

The system can further aggregate (step 620) the data (e.g., usingaggregation engine 220). As previously shown, the system can identifydifferent data sets that contain similar or related types of informationand aggregate that data into common data sets. Moreover, the system cancombine related data within the same data into a unified data set. Thesystem can further generate aggregate information related to the datacombined in the data sets. This information can include the number ofcertain codes referenced by certain professionals or, in a healthcarecontext, the number of prescriptions for a certain drug or grouping ofdrugs written by a physician. The system can aggregate all data or onlya subset of the available data.

The system can also determine analytics (step 630) or other statisticsrelated to the data (e.g., using pre-computation engine 230). Thesestatistics can be based both on the cleaned data (e.g., data stored indata storage 215) or aggregated data (e.g., data provided by aggregationengine 220). The system can analyze the data sets to generate someinitial statistics or analysis based on the various types of data. Theanalysis generated by the system can include information that isrepresentative of the data analyzed. As previously shown, the system canidentify sets of codes or other identifiers that commonly refer to thesame expertise. In some embodiments related to healthcare, the systemcan identify groupings of drugs or treatments that are indicative of aspecific treatment or conditions. This analysis is exemplary, and thesystem can perform additional analytics to generate statistics or otherdeterminations based on the provided data. The system can store theresults of the analysis for later use (e.g., in data storage 215).

The system can generate (step 640) one or more models (e.g., using modelbuilder 250) based on the obtained data (e.g., data stored in datastorage 215), aggregated data (e.g., data stored in data storage 215 andgenerated by aggregation engine 220), statistical or analytical data(e.g., data stored in data storage 215 and generated by pre-computationengine 230), as well as predetermined base models or other data. Usingthese various data sets, the system can create one or models adapted toanalyze data associated with professionals and determine theprofessionals' specific expertise. In a healthcare context, the systemcan identify physicians who treat a specific condition. In someembodiments, these models can identify physicians across specialtieswho, nevertheless, can treat the same conditions.

In some embodiments, the system updates (step 650) the one or moremodels using training set data (e.g., training sets provided by trainingset generator 240). The system can use various methods to createtraining sets that can be used to inform the model creation. Forexample, training sets can be generated, as previously described, basedon cross-referencing physician treatment codes with MeSH termsassociated with physician publications to determine characteristics andbehavioral patterns of physicians known to publish on and treat certainspecific conditions. Additionally, the system can use a similar analysisto build negative training sets containing physicians who use certaindiagnosis codes but do not publish or treat certain conditions that canbe associated with those diagnosis codes. By creating training sets, thesystem can create models that better identify specific expertise forprofessionals.

After generating models and/or updating models based on training sets,the system can provide (step 660) the one or models for searching (e.g.,by storing the models in data storage 255, which can be accessible toand used by search engine 260). The system can utilize the generatedmodels to search for specific expertise or needs. In a healthcarecontext, the system can search the models to identify physicians whotreat a specific condition or who perform a specific procedure.

The system can provide (step 670) a user interface (e.g., using userinterface 270) that displays the results of the search (e.g., resultsoutput by search engine 260). The system can also accept user input todirect the search. The system can provide the results in response todirect input from a user or can provide results pre-emptively based ondata known about the user. For example, the system can providerecommendations for treating physicians based on information associatedwith a patient who has previously provided claim or medical history.

In the foregoing specification, embodiments have been described withreference to numerous specific details that can vary from implementationto implementation. Certain adaptations and modifications of thedescribed embodiments can be made. Other embodiments can be apparent tothose skilled in the art from consideration of the specification andpractice of the invention disclosed herein. It is intended that thespecification and examples be considered as exemplary only. Moreover, itis intended that descriptions and examples related to specific domainsare not limiting and that the disclosed embodiments can be applied to avariety of different domains. It is also intended that the sequence ofsteps shown in figures are only for illustrative purposes and are notintended to be limited to any particular sequence of steps. As such,those skilled in the art can appreciate that these steps can beperformed in a different order while implementing the same method.

1.-18. (canceled)
 19. A data-driven dynamic modeling system: one or morememory devices storing processor executable instructions; and one ormore processors configured to execute the instructions to cause thedata-driven dynamic modeling system to perform: obtaining one or moredata records from one or more data sets, wherein at least one of the oneor more data records is associated with a professional; aggregating,using an aggregation engine, the one or more data records, based onrelated attributes of the one or more data records; analyzing, using apre-computation engine, the one or more combined data records and one ormore data records stored in the one or more data sets; computingadditional data based on the analysis; generating one or more models,using a model builder, based on the one or more combined data records,the one or more data records, and the computed additional data. trainingthe one or more models using training sets; evaluating the one or moretrained models based on known values; and using the one or more models,determining a specialty associated with the professional.
 20. Thedata-driven dynamic modeling system device of claim 19, wherein: the oneor more data records contain domain specific data; and the relatedattributes of the one or more data records are associated with thedomain.
 21. The data-driven dynamic modeling system of claim 20, whereinthe domain specific data includes at least one of medical claim data orprescription drug data.
 22. The data-driven dynamic modeling system ofclaim 20, wherein the domain specific data includes at least one ofCurrent Procedural Terminology (CPT) codes, Healthcare Common ProcedureCoding (HCPCS) codes, or International Statistical Classification ofDiseases and Related Health Problems (ICD) codes.
 23. The data-drivendynamic modeling system of claim 19, wherein the computed additionaldata corresponds to the aggregated data records.
 24. The data-drivendynamic modeling system of claim 19, wherein the training sets includeat least one of a positive training set identifying professionalactivities associated with a first specialty or a negative training setidentifying professional activities not associated with a secondspecialty.
 25. The data-driven dynamic modeling system of claim 19,wherein the training sets include publications categorized by medicalsubject heading (MeSH) terms.
 26. A method performed by one or moreprocessors and comprising: obtaining one or more data records from oneor more data sets, wherein at least one of the one or more data recordsis associated with a professional; aggregating, using an aggregationengine, the one or more data records, based on related attributes of theone or more data records; analyzing, using a pre-computation engine, theone or more combined data records and one or more data records stored inthe one or more data sets; computing additional data based on theanalysis; generating one or more models, using a model builder, based onthe one or more combined data records, the one or more data records, andthe computed additional data. training the one or more models usingtraining sets; evaluating the one or more trained models based on knownvalues; and using the one or more models, determining a specialtyassociated with the professional.
 27. The method of claim 26, wherein:the one or more data records contain domain specific data; and therelated attributes of the one or more data records are associated withthe domain.
 28. The method of claim 27, wherein the domain specific dataincludes at least one of medical claim data or prescription drug data.29. The method of claim 27, wherein the domain specific data includes atleast one of Current Procedural Terminology (CPT) codes, HealthcareCommon Procedure Coding (HCPCS) codes, or International StatisticalClassification of Diseases and Related Health Problems (ICD) codes. 30.The method of claim 26, wherein the training sets include at least oneof a positive training set identifying professional activitiesassociated with a first specialty or a negative training set identifyingprofessional activities not associated with a second specialty.
 31. Themethod of claim 26, wherein the training sets include domain specificinformation.
 32. A non-transitory computer readable storage mediumstoring instructions that are executable by a first computing devicethat includes one or more processors to cause the first computing deviceto perform a method for data driven expertise mapping, the methodcomprising: obtaining one or more data records from one or more datasets, wherein at least one of the one or more data records is associatedwith a professional; aggregating, using an aggregation engine, the oneor more data records, based on related attributes of the one or moredata records; analyzing, using a pre-computation engine, the one or morecombined data records and one or more data records stored in the one ormore data sets; computing additional data based on the analysis;generating one or more models, using a model builder, based on the oneor more combined data records, the one or more data records, and thecomputed additional data. training the one or more models using trainingsets; evaluating the one or more trained models based on known values;and using the one or more models, determining a specialty associatedwith the professional.
 33. The non-transitory computer readable mediumof claim 32, wherein: the one or more data records contain domainspecific data; and the related attributes of the one or more datarecords are associated with the domain.
 34. The non-transitory computerreadable medium of claim 33, wherein the domain specific data includesat least one of medical claim data or prescription drug data.
 35. Thenon-transitory computer readable medium of claim 33, wherein the domainspecific data includes at least one of Current Procedural Terminology(CPT) codes, Healthcare Common Procedure Coding (HCPCS) codes, orInternational Statistical Classification of Diseases and Related HealthProblems (ICD) codes.
 36. The non-transitory computer readable medium ofclaim 32, wherein the computed additional data corresponds to theaggregated data records.
 37. The non-transitory computer readable mediumof claim 32, wherein the training sets include at least one of apositive training set identifying professional activities associatedwith a first specialty or a negative training set identifyingprofessional activities not associated with a first specialty.
 38. Thenon-transitory computer readable medium of claim 32, wherein thetraining sets include domain specific information, the domain specificinformation including at least publications categorized by medicalsubject heading (MeSH) terms.